Goto

Collaborating Authors

 binary matrix


ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Neural Information Processing Systems

Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity reductions of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3-and 2-bit precision, respectively, and more than 80% memory and energy reductions over the original LLMs.


Profile Generators: A Link between the Narrative and the Binary Matrix Representation

Kutil, Raoul H., Zimmermann, Georg, Strasser-Kirchweger, Barbara, Borgelt, Christian

arXiv.org Artificial Intelligence

Mental health disorders, particularly cognitive disorders defined by deficits in cognitive abilities, are described in detail in the DSM-5, which includes definitions and examples of signs and symptoms. A simplified, machine-actionable representation was developed to assess the similarity and separability of these disorders, but it is not suited for the most complex cases. Generating or applying a full binary matrix for similarity calculations is infeasible due to the vast number of symptom combinations. This research develops an alternative representation that links the narrative form of the DSM-5 with the binary matrix representation and enables automated generation of valid symptom combinations. Using a strict pre-defined format of lists, sets, and numbers with slight variations, complex diagnostic pathways involving numerous symptom combinations can be represented. This format, called the symptom profile generator (or simply generator), provides a readable, adaptable, and comprehensive alternative to a binary matrix while enabling easy generation of symptom combinations (profiles). Cognitive disorders, which typically involve multiple diagnostic criteria with several symptoms, can thus be expressed as lists of generators. Representing several psychotic disorders in generator form and generating all symptom combinations showed that matrix representations of complex disorders become too large to manage. The MPCS (maximum pairwise cosine similarity) algorithm cannot handle matrices of this size, prompting the development of a profile reduction method using targeted generator manipulation to find specific MPCS values between disorders. The generators allow easier creation of binary representations for large matrices and make it possible to calculate specific MPCS cases between complex disorders through conditional generators.


Efficient Generation of Binary Magic Squares

Riou, Alain

arXiv.org Artificial Intelligence

We propose a simple algorithm for generating Binary Magic Squares (BMS), i.e., square binary matrices where the sum of all rows and all columns are equal. We show by induction that our algorithm always returns valid BMS with optimal theoretical complexity. We then extend our study to non-square Binary Magic Squares, formalize conditions on the sum of rows and columns for these BMS to exist, and show that a slight variant of our first algorithm can generate provably generate them. Finally, we publicly release two implementations of our algorithm as Python packages, including one that can generate several BMS in parallel using GPU acceleration.


Binary Quadratic Quantization: Beyond First-Order Quantization for Real-Valued Matrix Compression

Kuroki, Kyo, Okoshi, Yasuyuki, Van Chu, Thiem, Kawamura, Kazushi, Motomura, Masato

arXiv.org Artificial Intelligence

This paper proposes a novel matrix quantization method, Binary Quadratic Quantization (BQQ). In contrast to conventional first-order quantization approaches, such as uniform quantization and binary coding quantization, that approximate real-valued matrices via linear combinations of binary bases, BQQ leverages the expressive power of binary quadratic expressions while maintaining an extremely compact data format. We validate our approach with two experiments: a matrix compression benchmark and post-training quantization (PTQ) on pretrained Vision Transformer-based models. Experimental results demonstrate that BQQ consistently achieves a superior trade-off between memory efficiency and reconstruction error than conventional methods for compressing diverse matrix data. It also delivers strong PTQ performance, even though we neither target state-of-the-art PTQ accuracy under tight memory constraints nor rely on PTQ-specific binary matrix optimization. For example, our proposed method outperforms the state-of-the-art PTQ method by up to 2.2\% and 59.1% on the ImageNet dataset under the calibration-based and data-free scenarios, respectively, with quantization equivalent to 2 bits. These findings highlight the surprising effectiveness of binary quadratic expressions for efficient matrix approximation and neural network compression.



Hashing for Fast Pattern Set Selection

Karjalainen, Maiju, Miettinen, Pauli

arXiv.org Artificial Intelligence

Pattern set mining, which is the task of finding a good set of patterns instead of all patterns, is a fundamental problem in data mining. Many different definitions of what constitutes a good set have been proposed in recent years. In this paper, we consider the reconstruction error as a proxy measure for the goodness of the set, and concentrate on the adjacent problem of how to find a good set efficiently. We propose a method based on bottom-k hashing for efficiently selecting the set and extend the method for the common case where the patterns might only appear in approximate form in the data. Our approach has applications in tiling databases, Boolean matrix factorization, and redescription mining, among others. We show that our hashing-based approach is significantly faster than the standard greedy algorithm while obtaining almost equally good results in both synthetic and real-world data sets.


ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Neural Information Processing Systems

Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices.


An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Dehghankar, Mohsen, Erfanian, Mahdi, Asudeh, Abolfazl

arXiv.org Artificial Intelligence

Despite their tremendous success and versatility, Large Language Models (LLMs) suffer from inference inefficiency while relying on advanced computational infrastructure. To address these challenges and make LLMs more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of 1.58-bit LLMs with ternary weight matrices. Particularly focusing on matrix multiplication as the bottle-neck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a $n$ by $n$ weight matrix, our efficient algorithm guarantees a time complexity of $O(\frac{n^2}{\log n})$, a logarithmic factor improvement over the standard $O(n^2)$ vector-matrix multiplication. Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of the approach both with respect to time and memory, as we observed a reduction in inference time up to 29x and memory usage up to 6x.


Inferring Interaction Networks using the IBP applied to microRNA Target Prediction

Neural Information Processing Systems

Determining interactions between entities and the overall organization and clustering of nodes in networks is a major challenge when analyzing biological and social network data. Here we extend the Indian Buffet Process (IBP), a nonparametric Bayesian model, to integrate noisy interaction scores with properties of individual entities for inferring interaction networks and clustering nodes within these networks. We present an application of this method to study how microR-NAs regulate mRNAs in cells. Analysis of synthetic and real data indicates that the method improves upon prior methods, correctly recovers interactions and clusters, and provides accurate biological predictions.


b4d168b48157c623fbd095b4a565b5bb-Paper.pdf

Neural Information Processing Systems

Numerous datasets ranging from group memberships within social networks to purchase histories on e-commerce sites are represented by binary matrices. While this data is often either proprietary or sensitive, aggregated data, notably row and column marginals, is often viewed as much less sensitive, and may be furnished for analysis. Here, we investigate how these data can be exploited to make inferences about the underlying matrix H. Instead of assuming a generative model for H, we view the input marginals as constraints on the dataspace of possible realizations of H and compute the probability density function of particular entries H(i, j) of interest. We do this for all the cells of H simultaneously, without generating realizations, but rather via implicitly sampling the datasets that satisfy the input marginals. The end result is an efficient algorithm with asymptotic running time the same as that required by standard sampling techniques to generate a single dataset from the same dataspace. Our experimental evaluation demonstrates the efficiency and the efficacy of our framework in multiple settings.